Eff-3DPSeg: 3D Organ-Level Plant Shoot Segmentation Using Annotation-Efficient Deep Learning

Reliable and automated 3-dimensional (3D) plant shoot segmentation is a core prerequisite for the extraction of plant phenotypic traits at the organ level. Combining deep learning and point clouds can provide effective ways to address the challenge. However, fully supervised deep learning methods require datasets to be point-wise annotated, which is extremely expensive and time-consuming. In our work, we proposed a novel weakly supervised framework, Eff-3DPSeg, for 3D plant shoot segmentation. First, high-resolution point clouds of soybean were reconstructed using a low-cost photogrammetry system, and the Meshlab-based Plant Annotator was developed for plant point cloud annotation. Second, a weakly supervised deep learning method was proposed for plant organ segmentation. The method contained (a) pretraining a self-supervised network using Viewpoint Bottleneck loss to learn meaningful intrinsic structure representation from the raw point clouds and (b) fine-tuning the pretrained model with about only 0.5% points being annotated to implement plant organ segmentation. After, 3 phenotypic traits (stem diameter, leaf width, and leaf length) were extracted. To test the generality of the proposed method, the public dataset Pheno4D was included in this study. Experimental results showed that the weakly supervised network obtained similar segmentation performance compared with the fully supervised setting. Our method achieved 95.1%, 96.6%, 95.8%, and 92.2% in the precision, recall, F1 score, and mIoU for stem–leaf segmentation for the soybean dataset and 53%, 62.8%, and 70.3% in the AP, AP@25, and AP@50 for leaf instance segmentation for the Pheno4D dataset. This study provides an effective way for characterizing 3D plant architecture, which will become useful for plant breeders to enhance selection processes. The trained networks are available at https://github.com/jieyi-one/EFF-3DPSEG.


Introduction
High-throughput plant phenotyping is crucial to improve the understanding of the interactions between plant genotypes and phenotypes, which can be highly useful to speed up the selection of desired genotypes [1]. Traditional manual methods for plant phenotyping are highly labor-intensive, time-consuming, and prone to be inaccurate [2]. High-throughput plant phenotyping technologies have been identified as a bottleneck limiting systematic studies of plant gene functions and plant multi-omics research [3]. Recently, computer vision-based methods are gaining increased attention among plant researchers for efficiently visualizing plant architecture, measuring phenotypic traits, and reducing human errors.
Two-dimensional (2D) image-based methods have been used widely for high-throughput plant phenotyping during the last several decades [4]. For example, existing studies have demonstrated extracting multiple phenotypic traits from 2D images for a wide range of crops, such as tomato, maize, sorghum, and wheat [5][6][7]. However, all these 2D imaging technologies have some obvious drawbacks: (a) It is difficult to address the occlusion issues due to the lack of depth information, and (b) it is difficult to determine object structure information [8].
To address the disadvantages of 2D image-based methods, much effort has been made in the development of 3D imaging systems for plant phenotyping in the past decade [9]. Compared with 2D data, 3D data not only greatly address the aforementioned limitations but also provide opportunities to extract new and more complex phenotypic traits by generating accurate coordinates and distance estimates of objects [10]. At present, rapid 3D plant data acquisition benefits from the development of sensing technology and improvement of computational performance [11,12]. For 3D plant data acquisition, light detection and ranging (LiDAR) [11,13], time of flight (ToF) [14], depth cameras [15], and multi-view stereo (MVS) cameras [16,17] are widely used for plant 3D reconstruction and phenotypic analysis.
After acquiring the precise plant point clouds, reliable and automated plant organ segmentation becomes a premise for phenotypic analysis. Many existing studies developed traditional computer vision methods for plant organ segmentation from 3D data, such as threshold-based methods [18], geometrybased methods [19], the octree-based methods [20], and 3D skeleton-based methods [21]. These methods can handle several types of plants with simple structures through tedious and labor-intensive parameter tuning, which cannot be suitable for big-data processing requirements in high-throughput plant phenotyping [22].
Recently, there has been rapid growth in the field of deep learning-based methods, which can improve the generality and accuracy of plant instance segmentation. For example, Shi et al. [23] applied an FCN and a mask-RCNN for plant semantic and instance segmentation on multi-view images and implemented the 3D plant instance segmentation by projecting the segmentation results from 2D to 3D. Li et al. [24] developed a network named DeepSeg3DMaize for plant point cloud segmentation based on the PointNet model that integrated high-throughput data acquisition and deep learning. PlantNet [25], a dual-function point-based 3D deep learning network, simultaneously implemented stem-leaf semantic and instance segmentation for 3 different species of crops. DeepSeg3DMaize and PlantNet are point-based networks that can directly process the input point clouds but have the dynamic kernel computation overhead problem due to the irregular memory access pattern. Jin et al. [26] proposed a voxel-based deep learning network to realize maize semantic classification and leaf instance segmentation. The voxelbased methods can process the point clouds using 3D volumetric convolutions, which is good for local context modeling, but they require high resolution for the voxelization in order not to lose information. However, training such deep learning models is still challenging. First, all these 3D networks are fully supervised deep learning methods that need datasets with point-wise annotation for model training. Point-wise annotation of large-scale 3D plant point clouds is very time-consuming, and a user-friendly toolkit for annotation is currently absent. To address the problem of training data shortage, one of the potential solutions is to develop deep learning architectures, with learning models being able to harvest useful information from the raw plant point clouds and the tiny annotations. This is usually referred to as weakly supervised representation learning. For example, PointContrast [27] proposed a PointInfoNCE loss and verified its effectiveness on a set of weakly supervised 3D scene understanding tasks. Contrastive Scene Context (CSC) [28] introduced a loss function that contrasted features aggregated in local partitions. On plant domain, Wu and Xu [29] proposed a method for crop organ segmentation and disease recognition on 2D images, which is based on weakly supervised DCNN and lightweight model. Zhou et al. [30] explored the possibility of weakly supervised models for disease spot segmentation, which were trained using image-level annotations to reduce the cost of annotation work. While there have been several works on weakly supervised plant organ segmentation on 2D images, there is a paucity of research on this topic using 3D point clouds for plant instance segmentation.
Second, there are few large-scale well-labeled plant point cloud datasets and there is no universal benchmark data for plant organ instance segmentation. Pheno4D [31] is a spatiotemporal point cloud dataset but only includes 7 tomato and 7 maize plants. ROSE-X [32] provided an annotated 3D dataset of rosebush plants for training and evaluation of organ segmentation methods. However, this dataset only contained 11 annotated 3D plant models with organ labels for voxels corresponding to the plant shoot. Using numerous and high-quality raw 3D plant dataset is beneficial for deep learning models training to obtain better results. Thus, building a large-scale well-labeled 3D plant point cloud dataset is the key to the deep learningbased high-throughput plant phenotyping. In this work, we proposed a weakly supervised deep learning-based framework, Eff-3DPSeg, for both plant stem-leaf segmentation and leaf instance segmentation. To do so, a low-cost Multi-view Stereo Pheno Platform (MVSP2) was developed to acquire point clouds for individual plants, and then a point cloud annotation tool, Meshlab-based Plant Annotator (MPA), was used for the data annotation. After that, a weakly supervised deep learning network was developed for an end-to-end 3D plant architecture segmentation. Finally, 3 plant phenotypic traits were extracted based on the segmentation results. This work is the extension of our previous study Eff-PlantNet [33]. To summarize, the main contributions are as follows: 1. We proposed a weakly supervised 3D plant shoot segmentation framework: Eff-3DPSeg, which first learns meaningful representations from data without the usage of annotation through a voxel-base self-supervised learning method, and then fine-tunes the pretrained network with weakly supervised points (about 0.5% points being labeled). This learning strategy could produce a better performance than the training models using weak supervision directly [27,28]. This is a robust deep learning-based method that can be reapplied to other species of plants with minor modifications. 2. We built a large-scale well-labeled soybean spatiotemporal dataset, which includes point clouds in different growth stages among 3 weeks and organ-level annotations. 3. We demonstrated the effectiveness of the weakly supervised plant organ segmentation methods by comparing the segmentation performance with the fully supervised method using 2 types of plants: soybean and tomato.

Overview
Overall, the proposed Eff-3DPSeg framework consists of 3 parts (Fig. 1). The first part (Fig. 1A) is for the high-resolution plant point cloud acquisition and annotation. The plant point clouds were reconstructed using the low-cost photogrammetry platform MVSP2, and the point-wise annotation of point clouds was labeled by the tool MPA. The second part (Fig. 1B) is to apply the proposed weakly supervised deep learning networks for 3D plant shoot segmentation, including plant stem-leaf segmentation and leaf instance segmentation. The third part (Fig. 1C) is the plant phenotypic trait extraction. Utilizing the results of plant organ segmentation, we extracted 3 plant phenotypic traits, including stem diameter, leaf width, and leaf length.

Soybean point cloud acquisition
The platform MVSP2 mainly consists of a red-green-blue (RGB) camera (LUMIX DMC-G7W, Panasonic, Japan), a turntable, and light-emitting diode (LED) lights. Soybean seeds were planted in pots, which were put in a growth chamber until the   seeds were germinated and the unifoliate leaves were unfold. Then, plants were moved from the growth chamber to an indoor environment with room temperature (23°C), and the plants were lighted   representative point clouds of a soybean plant from 2022 May 6 to 27.

Soybean point cloud annotation
We developed the point cloud annotation tool MPA for point cloud annotation. In the processing, we selected the points of an individual organ using the built-in function "select vertex cluster" in MeshLab; we then assigned a predefined label to the selected points, which is indicated using a unique color. After annotation, the original point cloud and labels can be exported from MeshLab. We predefined a total of 70 categories in the tool, which is enough for the organ instance labeling for a soybean plant under different growing stages in this study. We labeled each point as "stem" or "leaf, " and each leaf had its unique ID label, differentiating it from other leaves in the same plant point cloud (Fig. 3). In this study, we totally annotated 145 plant samples.

Pheno4D dataset
To test the segmentation performance of the proposed 3D deep learning networks, we included another public plant point cloud dataset, Pheno4D [31], which had tomato and maize plants. In this study, tomato point clouds in Pheno4D were used because they have more complex structures compared with maize plants. The dataset contained 7 tomato plants scanned on 20 different days, generating 140 point clouds with 77 point clouds being annotated. The points were labeled into 3 categories: "soil, " "stem, " and "leaf, " where each leaf was annotated with a unique label, making it distinctive from the other leaves on the same plant. In Table 1, we showed a comparison between our dataset and the Pheno4D dataset.

Generation of annotation-efficient dataset
In this study, we explored 3D plant organ segmentation with a limited budget for plant point cloud annotation and we called these datasets as annotation-efficient datasets. We tried 3 different labeling settings, i.e., annotating 50, 100, and 200 points of each point cloud for the network training. To generate the annotation-efficient dataset, first, each plant point cloud was down-sampled with a ratio factor of 0.2. Then, we randomly chose [50, 100, 200] points in each down-sampled point cloud and kept the original labels and set other points' labels to "None. " In order to reduce the amount of the calculation and focus on plant organ, we deleted the "soil" points in Pheno4D. . After 2 random geometric transformations, we obtained its 2 augmentations X p and X q . They were fed to the shared Sparse ConvUnet f θ to obtain 2 high dimensional representation sets Z p and Z q (M × D, D is the number of representation dimension). To keep computation tractable, we applied the FPS on the representations to get down-sampled representation Z P ′ and Z q where H is the point number of the down-sampled representation. Finally, VIB was imposed on the cross-correlation matrix between Z P ′ and Z q ′ , which was denoted as Z. Fig. 6. Architecture of Sparse ConvUnet: "×" indicates a hypercubic kernel, and "+" indicates a hypercross kernel [36].

Overview of the proposed method
There are 2 main steps for the proposed weakly supervised plant organ segmentation method (Fig. 4). First, a backbone network is pretrained using a 3D self-supervised representation learning method, Viewpoint Bottleneck (VIB) [34,35]. Then, the pertained model was modified by adding a semantic segmentation head and an instance segmentation head and fine-tuned using the weakly annotated point clouds to implement the plant stem-leaf segmentation and leaf instance segmentation.

Self-supervised pretraining
The key to effective weakly supervised learning is leveraging numerous unlabeled points in the plant point clouds. To do so, we applied a self-supervised representation learning method, VIB, to learn meaningful representations from the plant point clouds without relying on any annotations. As shown in Fig (Fig 6), was used as the backbone to extract point features. This backbone provides discriminative voxel-based features for subsequent processing. In addition, it has relatively small GPU memory footprints, which well suits the high-resolution plant 3D data that have large amounts of points to be processed simultaneously and build deeper network to learn more representations of plant point clouds, such as contextual and geometric information. Therefore, Sparse ConvUnet can reduce the computational cost and memory requirements while maintaining high accuracy. It was implemented using MinkowskiEngine [36], which is an open-source auto-differentiation library for sparse tensors to implement the generalized sparse convolution. Compared with the point-based feature extraction scheme like PointNet++, high-resolution plant point clouds need down-sampling to avoid the high computational cost of the point-based network. Meanwhile, the downsampled methods are sensitive to the local point cloud density, which cannot obtain a better abstraction of the point clouds.
The network was trained with the VIB loss (Eq. 1): where λ is a positive constant trading off the 2 terms of the loss function. VIB loss aims to push diagonal elements Z ii to 1, and off-diagonal elements Z ij to 0. In this way, it maximizes the correlation between corresponding feature channels while decorrelating different feature channels. As shown in Fig. 5 (1) (right), 5 vectors of different colors demonstrate the sampled representations of point clouds. VIB operates on the feature dimension to correlate the corresponding channels and decorrelate the different channels.
Overall, through the self-supervised pretraining (Fig. 4A), a pretrained model was learned, which contained meaningful representations that leveraged the intrinsic structure between enormous unlabeled points in the plant point clouds.

Weakly supervised plant organ segmentation
After pretraining, we fine-tuned the pretrained model by adding a stem-leaf segmentation head (orange layer in Fig. 4B) and a leaf instance segmentation head (yellow layer in Fig. 4B) to implement stem-leaf segmentation and leaf instance segmentation with the weak annotation point clouds, respectively. In our experiments, we had 3 (50-, 100-, and 200-point) weakly supervised training settings. In Fig. 4B, the sparse different color points represented the weak annotations of the plant point cloud. This training scheme is much better than directly training with annotation-efficient point clouds, because the feature representation and intrinsic structure between enormous unlabeled points are fully leveraged by self-supervised pretraining. This meaningful representation information could be beneficial for the weakly supervised learning. In our implementation (Fig. 7), the backbone network provides discriminative point-wise features F for the subsequent processing.
Plant stem-leaf segmentation: We utilized a multi-layer perceptron (MLP) to produce stem leaf semantic score (M × n) by the point features F for the M points over the n classes. Then, the predicted stem leaf labels for each point were obtained through Argmax operation. The plant stem-leaf segmentation Sparse ConvUnet was optimized by a cross-entropy loss.
Plant leaf instance segmentation: The leaf instance segmentation refers to the task of assigning to every point not only a semantic label but also an instance ID of each leaf. In our implementation, we fed the point features F into 2 branches. One of the branches is the same as stem-leaf segmentation, achieving stem leaf semantic labels to select "leaf " points for individual leaf clustering. The other branch is called offset branch for predicting a point-wise offset vector O to shift original coordinate Coords toward the Shift Coords, which brings each point to its respective ground-truth instance centroid. In this way, the points from the same instance are directed toward a common centroid, bringing them closer together. The offset module was implemented by 2 sparse convolutional layers and a batch normalization layer. We used a clustering method [37] to group points into candidate clusters on stem leaf semantic labels S and dual coordinate sets, original Coords and Shift Coords, which produced Cc and Cs, respectively. In the clustering method, we used 1.5-mm ball as the threshold for every point to find its neighboring points. The threshold was selected based on the distances between points of soybean and tomato point clouds. Within the ball, the points are grouped into one individual leaf when they have the "leaf " label. The choice of the radius of the ball is affected by the point density of point clouds. Last, we obtained the final clustering results C as the union of Cc and Cs [36]. We trained the whole leaf instance segmentation network with the voting center loss including 3 parts [36] (Eq. 2): where L sem is a cross-entropy loss and L o-reg and L o-dir are losses of the offset prediction: where O = {o 1 ,…,o N }∈R N×3 is the offset vectors for M points, m = {m i ,…,m N } is a binary mask, ĉ i is the centroid of the instance, and Coords = {p i } is the point coordinate set. For points of the same instance, we constrain their learned offsets by an L 1 loss (Eq. 3). A direction loss (Eq. 4) constrains the direction of predicted offset vectors to ensure that the points move toward their instance centroids. (2)

Plant organ segmentation inference
After obtaining the models of plant stem-leaf and leaf instance segmentation, we fed the raw plant point clouds into the trained plant stem-leaf and leaf instance networks to achieve the results of plant organ segmentation.  In our study, the performance of plant stem-leaf segmentation accuracy analysis was evaluated using 5 quantitative metrics, such as precision, recall, F1 score, mean intersection over union (mIoU), and the IoU per class. These 5 metrics are defined as follows:

The network training, testing, and evaluation
where TP, FP, and FN are the true positives, false positives, and false negatives. n is the number of the label categories. mIoU is calculated by averaging the IoU over all the classes. The performance of plant leaf instance segmentation was evaluated by average precision (AP). In our experiments, AP@25 and AP@50 denoted that AP scores with IoU threshold set to 25% and 50%. AP averages the scores setting IoU threshold from 50% to 95% with the step of 5% [37].

Phenotypic trait extraction and evaluation
After plant organ segmentation, 3 plant phenotypic traits (stem diameter, leaf width, and length) were extracted (Fig. 1C). For the stem traits, we used the stem points of stem-leaf segmentation results to calculate the stem diameter. First, we separated stem points into 4 uniform parts along the z axis. Then, we fitted a straight-line segment on the part of stem points with minimum z value using the least squares method. Last, we computed the projection distances from these stem points to this straight line and chose twice the median of these distances as the stem diameter [38]. Leaf length and width were calculated by each leaf instance segmentation results. First, we computed the first, second, and third principal component axes of individual leaf points using the principal components analysis (PCA). We found 2 end points along the first axis. The leaf length was obtained by the shortest path between these 2 end points. Second, we divided the leaf points into 5 parts along the first principal component vector. Then, in each part, we found 2 end points along the second principal component vector and third principal component vector, respectively. We determined the longest path of the shortest path between 2 end points in these 2 groups as the leaf width [38].
The accuracy of the phenotypic trait extraction was calculated by the correlation coefficient (R 2 ) and root mean square error (RMSE).
where e l and e ′ l are the ground truth and prediction of the plant phenotypic trait, e l is the mean of the ground truth, and m denotes the number of the objects to be compared.

Plant stem-leaf segmentation
The stem-leaf segmentation results were assessed qualitatively and quantitatively. Figures 8 and 9 present the representative stem-leaf segmentation results for soybean and tomato plants at different growth days, respectively. Overall, these results showed that the proposed Eff-3DPSeg exhibited good generalization ability and accuracy for 3D plant shoot segmentation using weak supervision. From the qualitative results, it was observed that the semantic segmentation performance of all weak supervision settings was similar to the full supervision results. However, there were still some false classified points in the results. For the soybean plants, the misclassifications happened on points of the connections between a stem and a leaf and the edges of leaf as indicated in red zoomed-in boxes in Fig. 8. In the zoomed-in box a, some points of junctions of leaves and stem were misclassified as the category "leaf. " In the zoomed-in box b, some points in the edge of green leaf were misclassified as the category "stem. " Generally, a tomato plant has a more complex shoot structure and more leaves compared with a soybean plant in this study. Our method still showed good segmentation results on the tomato samples with both weakly supervised settings and fully supervised setting (Fig.  9). However, we had similar misclassification situations. For example, it was difficult to distinguish the exact junction between a leaf and a stem. As shown in the zoomed-in box b (Fig. 9), some stem points were falsely classified as leaf points. Especially in 50-point setting, some points of the main stem were even misclassified as leaf points in the zoomed-in box e. In other training scenarios, the accuracy of the segmentation outcomes for the main stem were high. In boxes c and d (Fig. 9), there were some gaps in the edge of the leaf. For this situation, the points of the leaf edge were falsely classified as stem points in 200-and 100-point settings. Fig. 10. The qualitative visualizations for weakly supervised soybean leaf instance segmentation. The samples were with different growth stages. The 5 rows showed the leaf instance segmentation ground truth and results of different supervision settings, respectively.
The quantitative results were summarized in Tables 2 and  3. Baseline means that the plant organ segmentation model is trained without any self-supervised representation learning pretrained weights. In a nutshell, Eff-3DPSeg outperformed the baseline by large margins in all quantitative metrics for both soybean and tomato plants. This demonstrated that our self-supervised pretraining method obtained meaningful representation from the plant point clouds, which was significantly beneficial for the weakly supervised plant organ segmentation. In the quantitative results, an interesting fact was that as the number of supervised points decreased, the margins grew larger. Meanwhile, we noticed that Eff-3DPSeg had better segmentation performance in simpler plant structures, such as the performance of leaf was better than the Fig. 11. The qualitative visualizations for weakly supervised tomato leaf instance segmentation. The selected tomato samples were with different growth stages. Tomato leaf instance segmentation ground truth and results of different supervision settings were shown in different rows. stem, and the soybean stem-leaf segmentation results were better than those of tomato. The reasons could be that (a) the number of leaf points is larger than that of stem in the training setting and the stem spatial structure is more complex than leaf and (b) the training data for soybean plants are richer than those for tomato plants. For the network training, the amount of training data directly affects the performances of the segmentation. In Table 2, our method achieved improvements of soybean stemleaf segmentation performance in all 4 metrics. However, for IoU and precision, the values of the 50-point supervision setting was larger than those of the 200-point setting. Additionally, it was observed that the performance of weakly supervised settings is lower than the fully supervised setting of Eff-3DPSeg, which is reasonable because the fully supervised setting extracts more features for the segmentation.
In addition, we also compared with several commonly used point cloud segmentation methods under full supervision setting, including PointNet [39], PointNet++ [40], and PVCNN [41] in Tables 2 and 3. It was observed that our method achieved the best performance among these methods. The means of precision, recall, F1 score, and IoU were all about 1% positive margins than the second-best method for both soybean and tomato datasets.

Plant leaf instance segmentation
The leaf instance segmentation performance for soybean and tomato plants under different growth days using different number of supervision points was also assessed qualitatively and quantitatively. Figures 10 and 11 present the representative results of plant leaf instance segmentation. Like the stem-leaf segmentation, it was observed that the network training with weak supervision points achieved nearly the same performance as that of fully supervised setting. However, because of dense leaves and limited training samples, misclassifications also happened on points of the edges of leaf and connections between a stem and a leaf. For example, in the zoomed-in box a in Figs. 10 and 11, some points of the edge of leaves were falsely classified. Due to the gaps in soybean leaves, the individual leaf was incorrectly clustered into several parts as shown in boxes a, b, and c (Fig. 10). Even in box b, some points of the stem were clustered as a part of a leaf. For tomato (Fig. 11), 3 close leaves were segmented as the same leaf under the100-and 50-point settings in boxes d and f. For box e, the edge points of the leaf were misclassified under the 100-point setting and one leaf was clustered into several parts under the 50-point supervision in box g. These were correct in the 200-point setting, except for points of the top stem, which were falsely clustered as a leaf in box b. This demonstrated that the performance of the leaf instance segmentation improves by increasing the number of supervision points. Table 4 summarizes the quantitative results of the soybean and tomato leaf instance segmentation. Compared with the with the number of supervision points decreased, the margins of performance with Eff-3DPSeg and baseline increased. However, we noticed that the weakly supervised results of the tomato leaf instance segmentation in the 200-point setting declined in AP@50 and AP@25 and increased in AP compared with the full supervision. The reason may be that the amount of tomato training data is not enough, only 55 point clouds, leading to the trained model without better statistical property. In all supervision settings, the segmentation performance of tomato was generally better than that of soybean. Since plant instance segmentation is a more challenging problem compared with leaf stem semantical segmentation due to the inherent variation in appearance and structure, we observed that there was a bigger margin for the leaf instance segmentation performance between the weakly supervised settings and the fully supervised setting compared with the leaf stem segmentation. But the accuracy increased as the number of annotated points increased, indicating that more ground truth information of the point clouds is needed to provide instance supervision. Meanwhile, we also compared our method with other point cloud segmentation methods with full supervision with the same experiment settings, including PointNet and PVCNN. Our method achieved the best performance among these methods, with large margins about over 30 AP@50.

Evaluation of extracted traits
The accuracy of extracted plant phenotypic traits was evaluated with the correlation coefficient R 2 and RMSE (Tables 5 and 6). We selected 9 soybean point clouds and 11 tomato point clouds to compare with the manual measurements and the extracted traits based on the segmentation results. Generally, our method achieved great performance for extracting plant organ-level phenotypic traits based on the proposed 3D deep learning plant organ segmentation method. However, there were opposite trends for the soybean and tomato datasets. For the tomato plants, the best performance was achieved with the 200-point setting; in contrast, the best performance for the soybean plants was achieved with the 50-point setting. That is because the accuracy of trait extraction was dependent on the performance  of the plant organ segmentation. These trends were the same as the results of the plant organ segmentation described in "Plant stem leaf segmentation" and "Plant leaf instance segmentation" sections. For the stem level, the stem diameter was extracted by the results of stem-leaf segmentation. Depending on the high performance of the tomato and soybean stem-leaf segmentation, high performance was achieved for both types of the plants in terms of the 2 evaluation metrics. However, the stem diameter R 2 and RMSE for the tomato plants were better than those for the soybean plants. That is because the ghost noisy points on the soybean point clouds affected the performance of the trait extraction. As shown in Fig. 12, the zoomed-in area 3 was the part of soybean stem, in which many ghost points were on the left of the stem. In contrast, the resolution of tomato stem was very high, and the details of the stem were displayed clearly in the zoomed-in area 1 (Fig. 12). The results of the leaf width and length depended on the leaf instance segmentation. The performance of extracted leaf phenotypic traits for the tomato plants was better than that for the soybean plants. On the other hand, there were some gaps in the soybean leaves (zoomed-in area 2 in Fig. 12), which affected the accuracy of the extracted leaf length and width.

Plant organ segmentation
Overall, the proposed weakly supervised framework Eff-3DPSeg achieved promising performance of plant organ segmentation for soybean and tomato plants under different growth days, and the performance was close to that of the fully supervised setting (Tables 2 to 6). It was observed that the misclassification mainly happened on the top sprouts of plants, the junction of the stems and leaves, and the edge of the leaves. This could be caused by the following reasons. First, in our experiments, we randomly chose 50, 100, and 200 points from the point clouds for weak annotation. The selected points may not be the best subset for the weakly supervised segmentation tasks. Second, the training dataset is not big enough [35]. More data under different treatments will be included in the future. The aforementioned reasons would result in the nonuniform distribution of labeled points, which cannot cover representative points of the plants.
The plant leaf instance segmentation network could obtain the stem-leaf and leaf instance segmentation simultaneously (Fig. 7). However, the stem-leaf segmentation performance obtained from the leaf instance segmentation network (Table 7) was not as good as that obtained from the network dedicated to the stem-leaf segmentation task (Tables 2 and 3). Because the leaf instance segmentation network made a trade-off of the stem-leaf segmentation and leaf instance segmentation, the weight of semantic segmentation part of loss function was only 1/3. In addition, the clustering part of instance segmentation depends on the outputs of 2 branches (Fig. 7). If the performance of stem-leaf segmentation is worse, it will affect the final results of the leaf instance segmentation. Hence, we will optimize the leaf instance segmentation framework. First, we will fuse the outputs of 2 branches to carry out effective information interaction between the semantic and instance feature map. Then, we will improve the loss function of stem-leaf and leaf instance segmentation to implement high performance of semantic and instance segmentation results, simultaneously.
For another similar 3D plant organ segmentation work PlantNet [25], it developed a fully supervised deep learning network for plant semantic and instance segmentation. The network required that the input point cloud has a fixed number of points (4,096). In addition, the input point cloud only contains the XYZ 3D coordinates. Other fully supervised plant organ segmentation network, PSegNet [42], used the voxelized FPS method to down-sample plant data to a point cloud of 4,096 points. However, our method is flexible to the dimension of the input point cloud. For example, our soybean dataset includes both XYZ 3D coordinates and RGB color information, and the Pheno4D dataset only contains 3D coordinate information (i.e., tomato). The ability allowing higher dimension input could be beneficial for improving the performance of plant organ segmentation by fusing new features. For example, we may fuse thermal or multispectral information to the point clouds in the future, and the new features would be useful for the segmentation tasks. Moreover, our method does not require the input point cloud with a fixed point number. Our network could conduct the huge computation for point clouds with more than 100,000 points, which is particularly useful for processing plants with large size shoots such as maize plants and trees. Additionally, our weakly supervised plant organ segmentation framework, Eff-3DPSeg, only needs to label around 0.5% of points, which can dramatically save the annotation time.
As shown in Table 8, we compared with coordinate and color information as the input fed into the fine-tuning network. When only coordinate information is the network input, the performance of soybean stem-leaf segmentation dramatically declined with the number of the supervised points decreased. Hence, the color information is important for the plant organ segmentation when the pretrained network learned meaningful information form unlabeled point clouds.

Plant phenotypic trait extraction
Based on the results of plant organ segmentation, we achieved good performance for plant organ phenotypic trait extraction for both datasets. However, it was observed that the results for the tomato dataset were slightly better than those for our soybean dataset. In addition to the performance difference of the plant organ segmentation for the 2 datasets, another reason is about the quality of plant point clouds. Our soybean dataset was reconstructed by a low-cost photogrammetry system (MVSP2), and the total cost was around $1,500. In contrast, the Pheno4D (tomato) dataset [31] was built by a light section scanner coupled to an articulated measuring arm, which could generate highresolution plant point clouds, but costed more than $20,000. We will add one more RGB camera for our imaging system to improve the point color quality.

Limitations and future works
Our framework, Eff-3DPSeg, achieved the promising performance for 3D plant shoot segmentation. Nevertheless, there are still some limitations in our method. First, some of the soybean data captured using the MVSP2 have gaps (missing points) on leaves (such as the point clouds captured on May 18 and 25 in Fig. 2), which affects the leaf instance segmentation. Additionally, the ghost noisy points (the point clouds captured on May 16 in Fig. 2) could also affect the trait measurements. In the future, we will develop point cloud preprocessing methods for gap filling and denoising. Second, our deep learning network needed to train the stem-leaf segmentation and leaf instance segmentation separately. Although we can obtain the results of stemleaf segmentation in the leaf instance segmentation processing, the performance of stem-leaf segmentation was worse than the performance of the model that was trained directly by the stemleaf segmentation network. We will improve the framework such as optimizing the loss function for the stem-leaf segmentation part to implement the stem-leaf and leaf instance segmentation simultaneously and increase the performance. Third, we provided 2 categories (stem and leaf) and only early growth stage of plants in this study. In the future, we can classify a plant intensively into more categories such as leaf, main stalk, branch, petiole, and growing point based on desired plant breeding objectives. Also, we will produce more point clouds with various plant species and different growth stages to further test our method, improving its generality and diversity.

Conclusion
In this study, we proposed a novel annotation-efficient deep learning framework, Eff-3DPSeg, for 3D plant shoot segmentation. First, we developed a low-cost multi-view imaging data acquisition platform (MVSP2) and a point cloud annotation tool (MPA) to build a spatiotemporal point cloud dataset for soybean plants. Then, 3 different annotation settings (50, 100, and 200 annotated points) for the soybean dataset and the public dataset Pheno4D were used to train and test the proposed network Eff-3DPSeg. Overall, our method achieved similar plant organ segmentation performance in 3D compared with the fully supervised setting, and then 3 organ-level phenotypic traits were well extracted. In addition, on the one hand, our method can dramatically save point cloud annotation time; on the other hand, the point cloud reconstruction can be achieved using a low-cost multi-view imaging platform. We believe that this work will contribute to the efficiency of high-throughput plant phenotyping and the development of smart agriculture.